Computational Psychiatry
● Ubiquity Press, Ltd.
Preprints posted in the last 30 days, ranked by how well they match Computational Psychiatry's content profile, based on 12 papers previously published here. The average preprint has a 0.01% match score for this journal, so anything above that is already an above-average fit.
Wei, M.; Zhang, H.; Peng, Q.
Show abstract
Background: Early initiation of substance use is linked to later adverse outcomes, and risk factors come from multiple domains and are shared across substances. In our previous work, traditional time-to-event Cox models identified individual risk factors, but these models are not designed to jointly model multiple outcomes or capture complex non-linear relationships. Multi-task learning (MTL) can leverage shared structure across related outcomes to improve prediction and distinguish common versus substance-specific predictors. However, most MTL studies rely on baseline features and focus on single outcomes, which limits their ability to capture shared risk and temporal changes. Substance use initiation is a time-dependent process that unfolds during development and reflects changing exposures over time. Baseline-only models cannot capture these changes or represent risk dynamics. Discrete-time modeling provides a practical approach by estimating interval-level initiation risk and combining it into cumulative risk at the subject level. By integrating multi-task learning with dynamic modeling, it is possible to share information across outcomes while capturing how risk evolves over time, which may improve prediction performance. Methods: Using the Adolescent Brain Cognitive Development (ABCD) Study (release 5.1), we developed two complementary multi-task learning (MTL) frameworks to predict initiation of alcohol, nicotine, cannabis, and any substance use. A baseline MTL model predicted fixed- horizon (48-month) initiation using one record per participant, while a dynamic discrete-time MTL model incorporated longitudinal interval data to model time-varying risk. Both models used multi-domain environmental exposures, core covariates, and polygenic risk scores (PRS). Performance was evaluated on a held-out test set using AUROC, PR-AUC, and calibration metrics, and compared with single-task logistic regression (LR). Feature importance was assessed using permutation importance and compared with Cox proportional hazards models. Results: MTL showed comparable or improved performance relative to LR, with larger gains for low-prevalence outcomes (cannabis and nicotine). Incorporating longitudinal information led to consistent improvements across all outcomes. Dynamic models increased AUROC by +0.044 to +0.062 for MTL and +0.050 to +0.084 for LR, indicating that temporal information was the primary driver of performance gains. Feature importance analyses showed modest overlap across methods, with higher agreement between dynamic MTL and Cox models than static MTL. A small set of features, including externalizing behavior, parental monitoring, and developmental factors, were consistently identified across all approaches. Conclusions: Dynamic multi-task learning improves the prediction of substance use initiation by leveraging longitudinal structure and shared information across outcomes. While MTL provides additional gains, incorporating time-varying information is the dominant factor for improving performance. Combining baseline and dynamic frameworks offers a comprehensive strategy for identifying robust risk factors and modeling adolescent substance use initiation.
Shi, Z.; Youngstrom, E. A.; Liu, Y.; Youngstrom, J. K.; Findling, R. L.
Show abstract
Pediatric bipolar disorder is challenging to diagnose accurately due to symptom heterogeneity. More standardized and data-driven approaches are needed to enhance diagnostic reliability. We evaluated a clinical decision tool (nomogram), statistical methods (logistic regression, LASSO), machine learning (support vector machine, random forest, k-nearest neighbors, extreme gradient boosting), and deep learning model (multilayer perceptron) for pediatric bipolar disorder prediction across two datasets collected in academic (N=550) and community (N=511) clinical settings. We compared three modeling strategies: cross-dataset validation, cross-dataset with interaction terms, and mixed-dataset. We assessed model performance using discrimination ability, calibration, and predictor importance ranking. In the baseline cross-dataset approach, all models showed good internal discrimination in the academic dataset; but external discrimination in the community dataset substantially declined. Interaction-enhanced models slightly improved internal discrimination but not external performance or calibration. Recalibration prominently improved cross-dataset calibration without compromising discrimination, indicating that transportability problems were largely driven by probability scaling. Models trained on mixed datasets exhibited much stronger external discrimination and calibration. Across models and training strategies, family risk and PGBI-10M were consistently ranked as the most important predictors. Predictive models for pediatric bipolar disorder showed strong internal performance but limited cross-setting generalizability due to dataset shift and miscalibration. Increasing model complexity did not improve external performance, whereas training on pooled data substantially improved both discrimination and calibration. Findings suggest that sampling diversity, rather than model complexity, is more valuable for developing clinically useful and generalizable psychiatric prediction models, underscoring the importance of open and collaborative datasets.
Kizilaslan, B.; Mehlum, L.
Show abstract
Purpose: Suicide and self-harm are major public health concerns characterized by substantial clinical and psychosocial heterogeneity. While latent class analysis has been used to identify subgroups of people with suicidal behavior, the extent to which such population-level phenotyping complements explainable artificial intelligence-based classification models remain unclear. Methods: We applied latent class analysis to a cross-sectional, publicly available dataset of 1000 individuals presenting with self-harm and suicide-related behaviors at Colombo South Teaching Hospital, Kalubowila, Sri Lanka. Sociodemographic, psychosocial, and clinical variables were used to identify latent subgroups. Class characteristics and suicide prevalence were examined and compared with variable importance patterns reported in a previously published explainable artificial intelligence (XAI)-based suicide classification study using the same dataset. Results: Four latent classes were identified. Two classes exhibited very high suicide prevalence (91.2% [95% CI: 87.7-93.8] and 99.0% [95% CI: 96.4-99.7]), whereas two classes showed low prevalence (<1%). The two high-prevalence classes differed markedly in lifetime psychiatric hospitalization history, with one class showing a 100% prevalence of prior hospitalization and the other substantially lower hospitalization rates. These patterns partially aligned with, and extended beyond, variable importance findings from the XAI-based model. Conclusion: Latent class analysis identified distinct subgroups with substantially different suicide prevalence and clinical profiles, underscoring the heterogeneity of individuals presenting with self-harm. Comparison with XAI-based suicide classification model findings suggest that unsupervised phenotyping and supervised classification provide complementary perspectives, offering population-level context that may enhance the interpretability of suicide assessment frameworks. Keywords: suicide; self-harm; latent class analysis; explainable artificial intelligence; machine learning
Flathers, M.; Nguyen, P. A. H.; Herpertz, J.; Granof, M.; Ryan, S. J.; Wentworth, L.; Moutier, C. Y.; Torous, J.
Show abstract
BackgroundMillions of people use language models to discuss mental health concerns, including suicidal ideation, but limited frameworks exist for evaluating whether these systems respond safely. Benchmarking, the practice of administering standardized assessments to language models, offers direct parallels to clinical competency evaluation, yet few clinicians are involved in designing, validating, or interpreting these assessments. AimsTo introduce mental health professionals to benchmarking language models by administering a validated clinical instrument and demonstrating how configuration decisions, measurement limitations, and scoring context affect result interpretation. MethodWe administered the Suicide Intervention Response Inventory (SIRI-2) programmatically to nine commercially available language models from three providers. Each item was presented 60 times per model (three prompt variants x two temperature settings x 10 repetitions), yielding 27,000 model responses compared against point-in-time expert consensus. ResultsTotal scores ranged from 19.5 to 84.0 (expert panel baseline: 32.5). Prompt design alone shifted individual model scores by as much as the difference between trained and untrained human groups. The best performing model approached the instruments measurement floor. All nine models consistently overrated clinically inappropriate responses that sounded supportive. ConclusionsA single benchmark score can support markedly different claims depending on the assumed standard of clinical behavior, the instruments remaining measurement range, and the configuration that produced the result. The skills required to make these distinctions must become core competencies. Benchmark results are increasingly utilized to support claims about mental health safety that may not be accurate, making it necessary to close the gap between clinical measurement and AI. Plain Language SummaryAI chatbots like ChatGPT, Claude, and Gemini are increasingly used by millions of people to discuss mental health problems, including thoughts of suicide. To assess whether these systems handle such conversations safely, researchers give them standardized tests called benchmarks and compare their answers to those of human experts. These scores are already used to argue AI systems are ready for clinical use. This study gave a well-established test of suicide response skills to nine AI models from three major companies under varying conditions. We changed how much instruction the AI received and how much randomness was built into its responses, then measured whether the scores changed. The same AI model could score like a trained crisis counselor under one set of conditions and like an untrained undergraduate under another, depending on choices the person running the test made. Every model also made the same kind of mistake: responses that sounded warm and caring were rated as appropriate, even when experts had judged them to be clinically problematic. The highest-scoring model performed so well that the test could no longer measure whether it was truly skilled or had simply exceeded the tests range. These findings show that a single score can be misleading without knowing how the test was run, whether it can still distinguish strong from weak performance, and whether it matches what the AI is used for. Mental health professionals routinely make these judgments about clinical assessments and are well positioned to bring that expertise to AI evaluation.
Vloeberghs, R.; Tuerlinckx, F.; Urai, A. E.; Desender, K.
Show abstract
A widely used framework for studying the computational mechanisms of decision making is the Drift Diffusion Model (DDM). To account for the presence of both fast and slow errors in empirical data, the DDM incorporates across-trial variability in parameters such as the drift rate and the starting point. Although these variability parameters enable the model to reproduce both fast and slow errors, they rely on the assumption that over trials each parameter is independently sampled. As a result, the DDM effectively predicts that errors-- whether fast or slow--occur randomly over time. However, in empirical data this assumption is violated, as error responses are often temporally clustered. To address this limitation, we introduce the autocorrelated DDM, in which trial-to-trial fluctuations in drift rate, starting point, and boundary evolve according to first-order autoregressive (AR1) processes. Using simulations, we demonstrate that, unlike the across-trial variability DDM, the autocorrelated DDM naturally accounts for temporal clustering of errors. We further show that model parameters can be reliably recovered using Amortized Bayesian Inference, even with as few as 500 trials. Finally, fits to empirical data indicate that the autocorrelated DDM provides the best account of error clustering, highlighting that computational parameters fluctuate over time, despite typically being estimated as fixed across trials.
Trivedi, S.; Simons, N. W.; Tyagi, A.; Ramaswamy, A.; Nadkarni, G. N.; Charney, A. W.
Show abstract
Background: Large language models (LLMs) are increasingly used in mental health contexts, yet their detection of suicidal ideation is inconsistent, raising patient safety concerns. Objective: To evaluate whether an independent safety monitoring system improves detection of suicide risk compared with native LLM safeguards. Methods: We conducted a cross-sectional evaluation using 224 paired suicide-related clinical vignettes presented in a single-turn format under two conditions (with and without structured clinical information). Native LLM safeguard responses were compared with an independent supervisory safety architecture with asynchronous monitoring. The primary outcome was detection of suicide risk requiring intervention. Results: The supervisory system detected suicide risk in 205 of 224 evaluations (91.5%) versus 41 of 224 (18.3%) for native LLM safeguards. Among 168 discordant evaluations, 166 favored the supervisory system and 2 favored the LLM (matched odds ratio {approx}83.0). Both systems detected risk in 39 evaluations, and neither in 17. Detection was highest in scenarios with explicit suicidal ideation and lower in more ambiguous presentations. Conclusions: Native LLM safeguards frequently failed to detect suicide risk in this structured evaluation. An independent monitoring approach substantially improved detection, supporting the role of external safety systems in high-risk mental health applications of LLMs.
Likar, M.; Brezoczki, B.; Vekony, T.; Simor, P.; Nemeth, D.
Show abstract
Mind wandering has been linked to a wide range of psychiatric conditions, yet most studies have examined these associations in isolation. Given the substantial comorbidity across the psychopathological spectrum, it remains unclear whether elevated mind wandering reflects a general marker of psychopathology or a more specific attentional-control deficit shared across symptom dimensions. To address this, we adopted a dimensional, transdiagnostic approach in a non-clinical sample (N = 376), simultaneously modeling seven symptom dimensions: ADHD, depression, obsessive-compulsive tendencies, schizotypy, autistic traits, hypomania, and eating disorder symptoms. At the bivariate level, mind wandering correlated positively with all symptom dimensions. However, when the substantial shared variance across dimensions was accounted for in both frequentist and Bayesian multivariate regression models, only ADHD symptoms emerged as a unique predictor ({beta} = 0.53; BF{square}{square} > 1000), with all remaining predictors yielding negligible unique contributions and Bayes factors supporting the null hypothesis. These findings suggest that previously reported associations between mind wandering and diverse psychopathological symptom dimensions largely reflect a shared liability with ADHD-related attentional dysregulation, rather than disorder-specific mechanisms. This positions mind wandering as a marker of attentional dysregulation more closely tied to ADHD symptomatology than to general psychopathological burden.
Ferreira, C.; Lim, A.
Show abstract
Background: AI powered cognitive behavioral therapy CBT chatbots represent a scalable approach to addressing the global mental health treatment gap However causal evidence on their population level effectiveness in low and middle income countries LMICs remains limited and patient perspectives on acceptability and engagement are critical determinants of sustained use Brazils Estrategia de Saude da Familia ESF deployed an AI powered CBT chatbot Saude Mental Digital SMD to registered patients aged 18 and older at participating primary care units with eligibility determined by a composite vulnerability score exceeding a predetermined threshold Objective: To estimate the causal effect of AI powered CBT chatbot access on anxiety and depressive symptoms among primary care patients in Minas Gerais Brazil leveraging the eligibility score threshold as an exogenous source of variation Methods: We conducted a fuzzy regression discontinuity design fuzzy RDD study using linked administrative and clinical data from 312 ESF primary care units across Minas Gerais N 43287 patients January 2022 December 2024 The running variable was the composite vulnerability score with a threshold of 60 points determining chatbot eligibility The primary outcome was the 12 week change in the Patient Health Questionnaire Anxiety and Depression Scale PHQ ADS composite score Two stage least squares 2SLS estimation was used with local polynomial regression and triangular kernel weighting Bandwidth selection followed the Calonico Cattaneo Titiunik CCT optimal procedure Results: The fuzzy RDD estimated a local average treatment effect LATE of 473 points 95 CI 691 to 255 p 0001 on the PHQ ADS composite score at the eligibility threshold indicating clinically meaningful symptom reduction among compliers First stage estimates confirmed a strong 312 percentage point jump in chatbot uptake at the threshold F statistic 1274 Subgroup analyses revealed larger treatment effects among patients in rural municipalities 618 95 CI 902 to 334 those with lower educational attainment 582 95 CI 844 to 320 and women 537 95 CI 761 to 313 McCrary density tests confirmed no evidence of running variable manipulation p 067 Results were robust across alternative bandwidths polynomial orders and kernel specifications Conclusions: AI powered CBT chatbot access causally reduces anxiety and depressive symptoms among primary care patients near the eligibility threshold in Brazil with particularly pronounced benefits for rural less educated and female populations These findings provide quasi experimental evidence supporting the scalable deployment of AI powered CBT tools within public primary care systems in LMICs while underscoring the importance of incorporating patient perspectives on acceptability to maximize engagement and sustained therapeutic benefit
Petri, L. E.; Lee, S. A.; Shire, D.; Leonard, S.; Behnke, A.; Greaney, J.; Alexander, L.; Almeida, D. M.; Picard, M.; Trumpff, C.
Show abstract
The present study analyzes the impact of naturalistic stress and emotions on saliva cell-free mitochondrial DNA (cf-mtDNA) in daily life across two independent cohorts with different temporal resolutions. Study 1 examined the interaction between daily stress and major depressive disorder (MDD) on cf-mtDNA in young adults (n= 18, 8 MDD, 10 controls) across four days. For individuals with MDD, stress exposure was associated with a 68% reduction in cf-mtDNA. A higher number or greater severity of stressors also reduced cf-mtDNA by 24 to 27%. Study 2 extended this framework by implementing a finer temporal resolution, measuring saliva and affective states every hour, up to 20 times per day for 2 days (n = 25). Negative emotions, including stress and frustration, were associated with reductions in cf-mtDNA of 15%, whereas positive emotions, such as happiness and calm, predicted increases of up to 28%. The strength and direction of the effects were person- and context-dependent. These findings suggest that cf-mtDNA does not exhibit a uniform stress response in daily life. Instead, it reflects dynamic signaling shaped by timing, emotional context, and diagnostic status. Accordingly, cf-mtDNA should be conceptualized as a dynamic biobehavioral signal rather than a static indicator of between-person differences.
Whitfield, J.; Goh, A.
Show abstract
BackgroundAI-powered cognitive behavioural therapy (AI-CBT) tools hold significant promise for addressing the global mental health treatment gap, yet sustained user engagement remains critically low. While patient attitudes and experiential factors have been qualitatively documented, the psychological mechanisms through which AI literacy translates into long-term engagement remain poorly understood. Existing systematic evidence highlights trust, perceived therapeutic alliance, and stigma as salient themes, but no large-scale quantitative study has modelled these as a mediated pathway. ObjectiveThis study aimed to (1) examine whether trust in AI systems and perceived therapeutic alliance mediate the relationship between AI literacy and sustained AI-CBT engagement, and (2) determine whether mental health stigma moderates these mediated pathways. MethodsA cross-sectional national online survey was conducted in the United Kingdom (N = 1,247). Eligible adults (18+) with a history of anxiety or depression who had used an AI-CBT tool in the preceding 12 months were recruited via stratified random sampling. Structural equation modelling (SEM) with moderated mediation was conducted in R (lavaan 0.6-17). Moderated mediation was evaluated using the PROCESS macro framework adapted for SEM, with 5,000 bootstrap replications for bias-corrected confidence intervals. Model fit was assessed using CFI, TLI, RMSEA, and SRMR indices. ResultsThe final SEM demonstrated excellent fit (CFI = 0.967, TLI = 0.959, RMSEA = 0.043 [90% CI: 0.036-0.051], SRMR = 0.052). AI literacy exerted a significant indirect effect on sustained engagement through trust in AI ({beta} = 0.213, SE = 0.031, p < .001) and perceived therapeutic alliance ({beta} = 0.187, SE = 0.028, p < .001). Mental health stigma significantly moderated the trust[->]engagement pathway ({Delta}R2 = 0.042, p = .003), with the indirect effect being stronger among individuals with lower stigma scores. The total indirect effect accounted for 58.4% of the total effect of AI literacy on engagement. ConclusionsAI literacy promotes sustained AI-CBT engagement primarily through its effects on trust and perceived therapeutic alliance, pathways that are attenuated by mental health stigma. These findings underscore the need for stigma-reduction interventions and AI literacy programmes as implementation strategies. Findings have direct implications for the design and deployment of AI-CBT tools across UK NHS digital mental health services.
Diekmann, N.; Lissek, S.; Uengoer, M.; Cheng, S.
Show abstract
The progress of learning is usually quantified by averaging responses across participants and/or multiple trials within a block. However, such approaches obscure the trial-by-trial progress of learning, which has been shown recently to express a rich variety of dynamics. An alternative approach which does not suffer from this problem is the detection and analysis of points of behavioral change, i.e., change-point analysis. Using change-point analysis, we reanalyzed data from human participants in different predictive learning tasks in which learned contingencies underwent reversal. We find that responses of individual participants were more accurately characterized by behavioral change points than the average learning curve. Importantly, change points significantly shifted to later trials during reversal learning indicating that reversal learning is more difficult than the initial learning. In a computational model based on deep reinforcement learning, we show that the change point shift required the replay of previous experiences, which in turn depends on the hippocampus. This finding is consistent with studies showing that lesions of the hippocampus yield faster reversal learning. In summary, we reaffirm the importance of the analysis of single participant responses, show that phenomenological learning rates are slower during reversal learning, and provide a theoretical account for this difference.
Navarro, V. M.; Brugger, S.; Wolpe, N.; Harding, J.; Fletcher, P.; Teufel, C.
Show abstract
Predictive coding has influenced many conceptual accounts of delusions, the bizarre and distressing beliefs that accompany a range of neuropsychiatric conditions. However, these explanations remain incomplete and have rarely been tested directly using formal modelling. Here, we present a formal account of delusional beliefs based on hybrid predictive coding, which sheds light on the computational mechanisms underpinning the core features of delusions: thematic recurrence and imperviousness to contradictory evidence. In simulation experiments, we demonstrate that a combination of contextually inadequate initialisation of beliefs and excessive certainty (a hallmark of psychosis), triggers a reorganisation of the generative model relating observed events to hidden causes. This reorganisation enables the maintenance of delusional beliefs that are thematically stable, internally consistent with external inputs, and impervious to contradictory evidence, all without an increase in prediction error. Overall, our results suggest that delusions may arise not from faulty inference, as previously argued, but as an adaptive consequence of generative models learned under atypical conditions. These findings provide mechanistic insights into the computations underpinning delusions and have important implications for a novel therapeutic strategy in terms of re-training generative models.
Givon-Schaham, N.; Shalev, N.
Show abstract
Adult ADHD is increasingly recognized across the lifespan, yet the psychometric equivalence of the Adult ADHD Self-Report Scale (ASRS) remains unverified for older populations. This study examined age-related Differential Item Functioning (DIF) in 600 adults (n = 100 per decade, ages 20-80) who completed the 18-item ASRS. Using a bi-factor Graded Response Model, we extracted latent ADHD trait scores ({omega}H = .895) and assessed DIF via ordinal logistic regression with adaptive age modeling. Five of 18 items exhibited significant uniform DIF. At equivalent latent severity, older adults were less likely to endorse hyperactivity symptoms in Part A (fidgeting, feeling "driven by a motor") but more likely to endorse specific symptoms in Part B (careless mistakes, misplacing items, interrupting). From ages 20 to 80, expected Part A scores decreased by 1.36 points (~0.27 per decade), while Part B scores increased by 1.15 points (~0.23 per decade). These findings indicate a phenotypic redistribution of ADHD symptoms as individuals age. Because the 6-item Part A screener serves as the primary clinical gatekeeper, its concentration of negative DIF suggests standard screening practice may systematically underestimate ADHD severity in older adults. We recommend using the full 18-item ASRS when screening older populations and suggest that developing age-adjusted norms would improve diagnostic accuracy.
Donegan, M. L.; Srivastava, A.; Peake, E.; Swirbul, M.; Ungashe, A.; Rodio, M. J.; Tal, N.; Margolin, G.; Benders-Hadi, N.; Padmanabhan, A.
Show abstract
The goal of this work was to leverage a large corpus of text based psychotherapy data to create novel machine learning algorithms that can identify suicide risk in asynchronous text therapy. Advances in the field of natural language processing and machine learning have allowed us to include novel data sources as well as use encoding models that can represent context. Our models utilize advanced natural language processing techniques, including fine-tuned transformer models like RoBERTa, to classify risk. Subsequent model versions incorporated non-text data such as demographic features and census-derived social determinants of health to improve equitable and culturally responsive risk assessment, as well as multiclass models that can identify tiered levels of risk. All new models demonstrated significant improvements over our previous model. Our final version, a multiclass model, provides a tiered system that classifies risk as "no risk," "moderate," or "severe" (weighted F1 of 0.85). This tiered approach enhances clinical utility by allowing providers to quickly prioritize the most urgent cases, ensuring a more accurate and timely intervention for clients in need.
Imtiaz, Z.; Kopell, B. H.; Olson, S.; Saez, I.; Song, H. N.; Mayberg, H. S.; Choi, K. S.; Waters, A. C.; Figee, M.; Smith, A. H.
Show abstract
Background: Deep brain stimulation (DBS) of the anterior limb of the internal capsule (ALIC) is an effective treatment for severe obsessive-compulsive disorder (OCD). Identifying brain readouts of positive response may guide further DBS optimization. Methods: We measured local field potential (LFP) changes from bilateral DBS leads in 10 OCD patients implanted at a uniform tractographic network target derived from prior DBS responders. We consistently stimulated dorsal lead contacts in the ALIC white matter, while recording LFP from the ventral lead contacts in grey matter of the anterior globus pallidus externus (GPe), a key node in the basal ganglia non-motor indirect pathway. Results: After six months of DBS, OCD symptoms decreased on average by 40% across subjects, along with a significant decrease in alpha activity across both hemispheres. Only one patient did not have an improvement of symptoms, and this was also the only patient to never exhibit an alpha decrease in either hemisphere. Conclusions: Our findings suggest that therapeutic ALIC DBS coincides with a stable decrease in limbic-cognitive GPe alpha power, which should be further investigated as a potential biomarker of sustained response.
Peck, F. C.; Walsh, C. R.; Truong, H.; Pochon, J.-B.; Enriquez, K.; Bearden, C. E.; Loo, S.; Bilder, R.; Lenartowicz, A.; Rissman, J.
Show abstract
Working memory (WM) supports the temporary maintenance of goal-relevant information and is disrupted across many neuropsychiatric disorders. We examined whether scalp electroencephalography (EEG) data features beyond spectral power, including waveform shape, broadband spectral structure, and signal complexity, provide complementary information for predicting cognitive and clinical outcomes. EEG was recorded from 200 adults spanning a broad range of neuropsychiatric symptom severity while they completed three WM task paradigms: Sternberg spatial WM (SWM), delayed face recognition (DFR), and dot pattern expectancy (DPX). Separate machine learning models were trained on EEG features from the encoding, delay, and probe phase of each task to predict participants task accuracy, reaction time (RT) variability, WM capacity, and psychopathology scores (Brief Psychiatric Rating Scale). A split-half analytic framework was used, with cross-validated model development in an exploratory dataset (N=100) and evaluation of statistically significant models in a held-out validation dataset (N=100). In the exploratory dataset, SWM task data best predicted WM capacity, DPX task data predicted RT variability, and DFR task data predicted psychopathology, suggesting that these three WM paradigms engage distinct neural processes relevant to different outcomes. No models reliably predicted task accuracy. Models incorporating features beyond spectral power generally outperformed power-only models, and task-derived features outperformed resting-state-derived features. However, only those models predicting WM capacity and RT variability generalized to the validation dataset; models predicting psychopathology did not. These findings demonstrate functional heterogeneity across WM paradigms, show that complementary EEG features enhance predictive modeling, and highlight the importance of rigorous validation for identifying robust brain-behavior relationships.
Bakstein, E.; Kudelka, J.; Schneider, J.; Slovakova, A.; Fialova, M.; Ihln, M.; Furstova, P.; Hlinka, J.; Spaniel, F.
Show abstract
BACKGROUND: Predicting long-term outcomes in first-episode schizophrenia (FES) remains difficult, despite being especially important early in the illness, when timely intervention is most critical. Many potential predictors have been studied, but few are reliable enough to guide early treatment decisions. It also remains unclear how much data from the initial phase of illness is required to improve prognostic accuracy. METHODS: We analysed 68 patients with first-episode schizophrenia (FES) assessed at baseline (V1; mean 0.5 years post-onset, YPO), one-year follow-up (V2; mean 1.2 YPO), and outcome (V3; mean 4.9 YPO). We trained elastic-net models to predict three V3 outcomes-negative symptoms (PANSS Negative factor; Wallwork/Fortgang), global functioning (GAF), and quality of life (WHOQOL-BREF psychological domain)-using either 23 V1 predictors alone or V1 predictors plus V2 data (43 predictors). Performance was evaluated with nested cross-validation on held-out data. RESULTS: Using predictors from the first year (V1+V2), we achieved statistically significant out-of-sample prediction for all three V3 outcomes: PANSS Negative factor (Wallwork/Fortgang) R2=0.22 driven mainly by log(DUP), PANSS Negative at V1/V2, and PANSS Disorganized at V2; WHOQOL-BREF Psychological Health R2=0.22 driven mainly by WHOQOL Psychological Health at V2 and GAF at V2; and GAF R2=0.14 driven mainly by GAF at V2, PANSS Positive at V2, WHOQOL Psychological Health at V2, and hospitalization burden (before V1 and between V1-V2). With baseline-only predictors (V1), only PANSS Negative showed meaningful predictive power (R2=0.15); GAF and WHOQOL-BREF did not outperform the intercept-only baseline. CONCLUSION: In FES, long-term functioning (GAF) and quality of life (WHOQOL-BREF) can not be predicted well from first-episode (V1) measures; at least an additional 1 year of follow-up is needed, implying these outcomes are driven by changes after onset that V1 misses. Negative symptoms differ: they are comparatively stable after initial antipsychotic treatment, and duration of untreated psychosis is their strongest predictor beyond baseline severity-consistent with early biology and treatment timing shaping their level and persistence. These contrasting patterns indicate different outcome phenotypes.
Zhang, S.; Wang, H.; Mendoza, R. B.
Show abstract
Resource sharing is a fundamental form of social exchange underlying the formation and maintenance of social bonds in humans and other species. While reciprocity has long been proposed as a key mechanism in group interactions, the dynamic processes underlying resource allocation remain poorly understood. In this study, we employed computational modeling to investigate the temporal dynamics of resource sharing in a novel group decision-making task across three experiments. We found that, beyond the well-documented reciprocity, participants exhibited consistent alternating behavior, characterized by the switching between potential recipients. This alternation was not driven by fairness concerns but reflected a strategic balance between maintaining stable partnerships and exploring alternatives. Crucially, a reinforcement learning model incorporating Theory of Mind (ToM) consistently outperformed all alternative models. These findings highlight the critical role of ToM in social decision-making and suggest that mentalizing others intentions may be essential for effective resource sharing and social bond formation.
Zhu, T.; Tashevski, A.; Taquet, M.; Azis, M.; Jani, T.; Broome, M. R.; Kabir, T.; Minichino, A.; Murray, G. K.; Nour, M. M.; Singh, I.; Fusar-Poli, P.; Nevado-Holgado, A.; McGuire, P.; Oliver, D.
Show abstract
Psychosis prevention relies on early detection of individuals at clinical high risk for psychosis (CHR-P) remains limited, constraining preventive care. The effectiveness of the CHR-P state is constrained, in part due to clinical assessments requiring specialist interpretation of narrative interviews, limiting scalability. Here, we evaluate whether large language models (LLMs; deep learning models trained on large text corpora to process and generate language) can extract clinically meaningful information from such interviews to support psychosis risk assessment. We assessed 11 open-weight LLMs on 678 PSYCHS interview transcripts from 373 participants (77.7% CHR-P). Models inferred CHR-P status and estimated severity and frequency across 15 symptom domains, benchmarked against researcher-rated scores. Larger models achieved the strongest classification performance (Llama-3.3-70B: accuracy = 0.80, sensitivity = 0.93, specificity = 0.58). LLM-generated symptom scores showed good correlations with researcher-rated scores (ICCsev = 0.74, ICCfreq = 0.75). Performance disparities were minimal across most demographic groups but varied across sites. Generated summaries were largely faithful to source transcripts, with low rates of clinically relevant confabulation (3%). Errors primarily reflected over-pathologisation of non-clinical experiences. While accuracy scaled with model size, smaller models achieved competitive performance with substantially lower computational cost. These findings demonstrate that open-weight LLMs can assess psychosis risk from clinical interview transcripts, supporting scalable, human-in-the-loop approaches to early detection.
Lim, A.; Pemberton, J.
Show abstract
Background: The NHS Improving Access to Psychological Therapies (IAPT) programme, now rebranded as NHS Talking Therapies, faces persistent capacity constraints with average wait times exceeding 90 days for cognitive behavioral therapy (CBT) in many Clinical Commissioning Group areas. AI-powered CBT platforms have been introduced as a digital adjunct within stepped care, yet longitudinal evidence on anxiety symptom trajectories and their predictors in routine NHS settings remains limited. Objective: To model individual anxiety symptom trajectories among patients referred to an AI-powered CBT platform within NHS primary care, identify distinct trajectory classes, and examine patient-level and practice-level predictors of differential treatment response using multilevel growth curve modeling. Methods: A prospective cohort study was conducted using linked clinical and administrative data from 6,284 patients (aged 18-65) referred to the CalmLogic AI-CBT platform across 187 general practices in four NHS England Integrated Care Systems (ICSs) between April 2023 and September 2025. Patients completed GAD-7 assessments at baseline, 4 weeks, 8 weeks, 12 weeks, and 24 weeks. Three-level growth curve models (assessments nested within patients nested within practices) with random intercepts and random slopes were fitted. Growth mixture modeling (GMM) was subsequently applied to identify latent trajectory classes. Predictors were examined at Level 2 (patient demographics, baseline severity, comorbidities, digital literacy, engagement intensity) and Level 3 (practice deprivation index, list size, urban/rural classification, and IAPT wait time). Results: The unconditional growth model revealed a significant average linear decline in GAD-7 scores of -0.94 points per month (p < .001), with substantial between-patient variation in both intercepts (variance = 14.82, p < .001) and slopes (variance = 0.38, p < .001). Significant between-practice variation accounted for 8.7% of intercept variance (ICC = 0.087). Growth mixture modeling identified four distinct trajectory classes: Rapid Responders (28.4%, steep early decline stabilising by week 8); Gradual Improvers (34.1%, steady linear decline through 24 weeks); Partial Responders (22.8%, modest early improvement followed by a plateau at clinically significant levels); and Non-Responders (14.7%, minimal change or slight deterioration). Higher baseline severity, female gender, and greater module completion predicted membership in the Rapid Responder class. Practice-level IAPT wait times exceeding 90 days independently predicted faster improvement trajectories (coefficient = -0.31, p = .003), suggesting that AI-CBT has its greatest incremental value in capacity-constrained areas. Patients in the most deprived quintile showed slower trajectories (coefficient = 0.22, p = .011) despite equivalent engagement levels, indicating a deprivation-related treatment response gap. Conclusions: AI-powered CBT platforms integrated within NHS primary care produce significant anxiety symptom reduction on average, but treatment response is heterogeneous, with four distinct trajectory classes identified. The finding that longer IAPT wait times predict better AI-CBT outcomes supports the platform's positioning as a scalable bridge intervention for capacity-constrained services. The deprivation-related response gap warrants targeted support strategies for patients in the most disadvantaged communities.